[Bug][TMA] Drop buffer_ob gate from 1D bulk-copy eligibility#2235
[Bug][TMA] Drop buffer_ob gate from 1D bulk-copy eligibility#2235mygitljf wants to merge 1 commit into
Conversation
When the global tensor has a dynamic outer shape and the index cannot be statically proven within bounds, `buffer_oob` is set to true. The existing logic in `AnalyzeCopyFacts` then masked out 1D bulk-copy eligibility along with the descriptor-based 2D path, causing `InferLayout` to install a swizzle-shaped shared layout via `InferBulkLayout` and forcing `Copy::Lower` to fall through to the 2D `LowerBulk`. There the `inner_box_dim > 256` branch issues `(K / 256)` separate `tl::tma_load` calls instead of a single 1D bulk copy. The 1D bulk-copy path emits `cp.async.bulk` (not `cp.async.bulk.tensor`) and has the same OOB semantics as a plain `T.copy()` - it does not need the descriptor-only OOB gate. Drop `!ctx.buffer_oob` from the 1D eligibility check so dynamic-outer-shape 1D copies that already satisfy `CheckBulkCopy1D` (contiguous innermost slice, full-extent trailing dim, element count match) keep the single-issue path. Repro: see issue tile-ai#2180 - `T.tma_copy(A[var, 0:K], a_shared, barrier=mbar)` with `M=T.dynamic('M')`, `K=T.const('K')` on `A[M, K]` was lowering to 4 split `tl::tma_load` calls; with this patch it lowers to a single `tl::tma_load(smem, gmem, mbar, total_bytes)` and no `CUtensorMap` descriptor. Verified locally on A100 at the codegen-string level (end-to-end launch requires sm_90+ and is left to CI). Tests: 3 new regression cases in `testing/python/language/test_tilelang_language_tma_1d.py` covering fp32/K=1024, fp32/K=512, fp16/K=1024 with `T.dynamic` outer shape. All existing TMA tests pass without regression.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughFixed TMA availability gating in ChangesIssue
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Fixes #2180.
When i was looking at the issue, the 4-way split came from the 1D bulk-copy eligibility being gated on the same flag the descriptor-based 2D path uses, even though the 1D path doesn't actually need that gate.
I narrowed it to a single safety check that only matters for the descriptor variant and then added 3 regression tests covering fp32/K=1024, fp32/K=512 and fp16/K=1024 with a
T.dynamicouter shape.Verification
Pre-patch generated source for the issue's repro (extracted from the codegen string):
Post-patch:
A single 1D cp.async.bulk issue, no CUtensorMap descriptor — matches the expected behavior in the issue.